Collaborative Statistics Using R 



By: 

Ananda Mahto 



Collaborative Statistics Using R 



By: 

Ananda Mahto 



Online: 

< http://cnx.Org/content/colll219/l.7/ > 



CONNEXIONS 

Rice University, Houston, Texas 



This selection and arrangement of content as a collection is copyrighted by Ananda Mahto. It is licensed under the 

Creative Commons Attribution 3.0 license (http://creativecommons.Org/licenses/by/3.0/). 

Collection structure revised: January 4, 2011 

PDF generated: February 6, 2011 

For copyright and attribution information for the modules contained in this collection, see p. 37. 



Table of Contents 



Preface 1 

1 Sampling and Data 

1.1 Introduction 3 

1.2 Key Terms 3 

1.3 Data 5 

1.4 Sampling 6 

1.5 Variation and Critical Evaluation 9 

1.6 Frequency, Relative Frequency and Cumulative Frequency 12 

1.7 Practice 15 

Solutions 18 

2 Descriptive Statistics 

2.1 Introduction 19 

2.2 Stem and Leaf Graphs and Bar Graphs 19 

2.3 Histograms 24 

2.4 Box Plots 28 

Solutions 32 

Glossary 34 

Index 36 

Attributions 37 



IV 



Preface 



About "Collaborative Statistics" 

Collaborative Statistics was written by Barbara Illowsky and Susan Dean, faculty members at De Anza 
College in Cupertino, California. 

The original preface to the book as written by professors Illowsky and Dean, now follows: 
This book is intended for introductory statistics courses being taken by students at two- and four- 
year colleges who are majoring in fields other than math or engineering. Intermediate algebra is the only 
prerequisite. The book focuses on applications of statistical knowledge rather than the theory behind it. 
The text is named Collaborative Statistics because students learn best by doing. In fact, they learn best by 
working in small groups. The old saying "two heads are better than one" truly applies here. 
Our emphasis in this text is on four main concepts: 

• thinking statistically 

• incorporating technology 

• working collaboratively 

• writing thoughtfully 

These concepts are integral to our course. Students learn the best by actively participating, not by just 
watching and listening. Teaching should be highly interactive. Students need to be thoroughly engaged 
in the learning process in order to make sense of statistical concepts. Collaborative Statistics provides 
techniques for students to write across the curriculum, to collaborate with their peers, to think statistically, 
and to incorporate technology. 

This book takes students step by step. The text is interactive. Therefore, students can immediately apply 
what they read. Once students have completed the process of problem solving, they can tackle interesting 
and challenging problems relevant to today's world. The problems require the students to apply their 
newly found skills. The book also contains labs that use real data and practices that lead students step by 
step through the problem solving process. 

About this custom edition of Collaborative Statistics 

This custom edition of Collaborative Statistics is designed for use in a short course in introductory statistics. 
Additionally, the text includes examples of how to use the R-project open-source statistical package for the 
calculations. 

R software was chosen for several reasons. First, it is free. Second, it is relatively easy to learn once 
you actually start using it. Third, the software is stable and quite advanced; there are many features imple- 
mented in R that are not found in commercial software packages. Fourth, R has great community support; 
if there are any questions you might have, there are numerous user-groups which can help you solve your 
problems. 



lr rhis content is available online at <http://cnx.org/content/m35060/!. l/>. 



We hope that you enjoy the process of learning about statistics and simultaneously learning how to use 
R. 



Chapter 1 

Sampling and Data 



1.1 Introduction 1 

1.1.1 Student Learning Objectives 

By the end of this chapter, the student should be able to: 

• Recognize and differentiate between key terms. 

• Apply various types of sampling methods to data collection. 

• Create and interpret frequency tables. 

• Be able to apply some basic functions in R to generate samples and to summarize data. 

The R functions you will be using in this chapter are cumsumO, cut (), length (), sample (), set .seed(), 
and table (). 

1.1.2 Introduction 

You are probably asking yourself the question, "When and where will I use statistics?". If you read any 
newspaper or watch television, or use the Internet, you will see statistical information. There are statistics 
about crime, sports, education, politics, and real estate. Typically, when you read a newspaper article or 
watch a news program on television, you are given sample information. With this information, you may 
make a decision about the correctness of a statement, claim, or "fact." Statistical methods can help you make 
the "best educated guess." 

Since you will undoubtedly be given statistical information at some point in your life, you need to 
know some techniques to analyze the information thoughtfully. Think about buying a house or managing 
a budget. Think about your chosen profession. The fields of economics, business, psychology, education, 
biology, law, computer science, police science, and early childhood development require at least a basic 
understanding of statistics. 

Included in this chapter are the basic ideas of statistics. You will also learn how data are gathered and 
what "good" data are. Additionally, you will be introduced to some very basic functions in R to help you 
work more efficiently. 

1.2 Key Terms 2 

In statistics, we generally want to study a population. You can think of a population as an entire collection 
of persons, things, or objects under study. To study the larger population, we select a sample. The idea of 



lr rhis content is available online at <http://cnx.Org/content/m35069/l. 2/>. 
2 This content is available online at <http://cnx.Org/content/m35062/l. l/>. 



4 CHAPTER 1. SAMPLING AND DATA 

sampling is to select a portion (or subset) of the larger population and study that portion (the sample) to 
gain information about the population. Data are the result of sampling from a population. 

Because it takes a lot of time and money to examine an entire population, sampling is a very practi- 
cal technique. If you wished to compute the overall grade point average at your school, it would make 
sense to select a sample of students who attend the school. The data collected from the sample would be 
the students' grade point averages. In presidential elections, opinion poll samples of 1,000 to 2,000 people 
are taken. The opinion poll is supposed to represent the views of the people in the entire country. Man- 
ufacturers of canned carbonated drinks take samples to determine if a 16 ounce can contains 16 ounces of 
carbonated drink. 

From the sample data, we can calculate a statistic. A statistic is a number that is a property of the 
sample. For example, if we consider one math class to be a sample of the population of all math classes, 
then the average number of points earned by students in that one math class at the end of the term is an 
example of a statistic. The statistic is an estimate of a population parameter. A parameter is a number that 
is a property of the population. Since we considered all math classes to be the population, then the average 
number of points earned per student over all the math classes is an example of a parameter. 

One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter. 
The accuracy really depends on how well the sample represents the population. The sample must contain 
the characteristics of the population in order to be a representative sample. We are interested in both the 
sample statistic and the population parameter in inferential statistics. In a later chapter, we will use the 
sample statistic to test the validity of the established population parameter. 

A variable, notated by capital letters like X and Y, is a characteristic of interest for each person or 
thing in a population. Variables may be numerical or categorical. Numerical variables take on values 
with equal units such as weight in pounds and time in hours. Categorical variables place the person or 
thing into a category. If we let X equal the number of points earned by one math student at the end of a 
term, then X is a numerical variable. If we let Y be a person's party affiliation, then examples of Y include 
Republican, Democrat, and Independent. Y is a categorical variable. We could do some math with values 
of X (calculate the average number of points earned, for example), but it makes no sense to do math with 
values of Y (calculating an average party affiliation makes no sense). 

Data are the actual values of the variable. They may be numbers or they may be words. Datum is a 
single value. 

Two words that come up often in statistics are average and proportion. If you were to take three exams 
in your math classes and obtained scores of 86, 75, and 92, you calculate your average score by adding the 
three exam scores and dividing by three (your average score would be 84.3 to one decimal place). If, in your 
math class, there are 40 students and 22 are men and 18 are women, then the proportion of men students 
is || and the proportion of women students is |jj . Average and proportion are discussed in more detail in 
later chapters. 

Example 1.1 

Define the key terms from the following study: We want to know the average amount of money 
first year college students spend at ABC College on school supplies that do not include books. We 
randomly survey 100 first year students at the college. Three of those students spent $150, $200, 
and $225, respectively. 

Solution 

The population is all first year students attending ABC College this term. 

The sample could be all students enrolled in one section of a beginning statistics course at ABC 
College (although this sample may not represent the entire population). 

The parameter is the average amount of money spent (excluding books) by first year college 
students at ABC College this term. 

The statistic is the average amount of money spent (excluding books) by first year college 
students in the sample. 

The variable could be the amount of money spent (excluding books) by one first year student. 



Let X = the amount of money spent (excluding books) by one first year student attending ABC 
College. 

The data are the dollar amounts spent by the first year students. Examples of the data are $150, 
$200, and $225. 



1.3 Data 3 

Data may come from a population or from a sample. Small letters like x or y generally are used to represent 
data values. Most data can be put into the following categories: 

• Qualitative 

• Quantitative 

Qualitative data are the result of categorizing or describing attributes of a population. Hair color, blood 
type, ethnic group, the car a person drives, and the street a person lives on are examples of qualitative data. 
Qualitative data are generally described by words or letters. For instance, hair color might be black, dark 
brown, light brown, blonde, gray, or red. Blood type might be AB+, O-, or B+. Qualitative data are not as 
widely used as quantitative data because many numerical techniques do not apply to the qualitative data. 
For example, it does not make sense to find an average hair color or blood type. 

Quantitative data are always numbers and are usually the data of choice because there are many meth- 
ods available for analyzing the data. Quantitative data are the result of counting or measuring attributes of 
a population. Amount of money, pulse rate, weight, number of people living in your town, and the number 
of students who take statistics are examples of quantitative data. Quantitative data may be either discrete 
or continuous. 

All data that are the result of counting are called quantitative discrete data. These data take on only 
certain numerical values. If you count the number of phone calls you receive for each day of the week, you 
might get 0, 1, 2, 3, etc. 

All data that are the result of measuring are quantitative continuous data assuming that we can measure 
accurately. Measuring angles in radians might result in the numbers j, j ,j , n , ^ , etc. If you and your 
friends carry backpacks with books in them to school, the numbers of books in the backpacks are discrete 
data and the weights of the backpacks are continuous data. 

Example 1.2: Data Sample of Quantitative Discrete Data 

The data are the number of books students carry in their backpacks. You sample five students. 
Two students carry 3 books, one student carries 4 books, one student carries 2 books, and one 
student carries 1 book. The numbers of books (3, 4, 2, and 1) are the quantitative discrete data. 

Example 1.3: Data Sample of Quantitative Continuous Data 

The data are the weights of the backpacks with the books in it. You sample the same five students. 
The weights (in pounds) of their backpacks are 6.2, 7, 6.8, 9.1, 4.3. Notice that backpacks carrying 
three books can have different weights. Weights are quantitative continuous data because weights 
are measured. 

Example 1.4: Data Sample of Qualitative Data 

The data are the colors of backpacks. Again, you sample the same five students. One student has 
a red backpack, two students have black backpacks, one student has a green backpack, and one 
student has a gray backpack. The colors red, black, black, green, and gray are qualitative data. 



3 This content is available online at <http://cnx.Org/content/m35064/l. l/>. 



CHAPTER 1 . SAMPLING AND DATA 

NOTE: You may collect data as numbers and report it categorically. For example, the quiz scores 
for each student are recorded throughout the term. At the end of the term, the quiz scores are 
reported as A, B, C, D, or F. 



1.4 Sampling 4 

Gathering information about an entire population often costs too much or is virtually impossible. Instead, 
we use a sample of the population. A sample should have the same characteristics as the population it 
is representing. Most statisticians use various methods of random sampling in an attempt to achieve this 
goal. This section will describe a few of the most common methods. 

There are several different methods of random sampling. In each form of random sampling, each 
member of a population initially has an equal chance of being selected for the sample. Each method has 
pros and cons. The easiest method to describe is called a simple random sample. Two simple random 
samples contain members equally representative of the entire population. In other words, each sample of 
the same size has an equal chance of being selected. For example, suppose Lisa wants to form a four-person 
study group (herself and three other people) from her pre-calculus class, which has 33 members including 
Lisa. To choose a simple random sample of size 3 from the other members of her class, Lisa could put all 
the other 32 names in a hat, shake the hat, close her eyes, and pick out 3 names. A more technological way 
is for Lisa to first list the names of her classmates together with a two-digit number as shown below. 

Class Roster 



ID 


Name 


ID 


Name 


ID 


Name 


01 


Anselmo 


12 


Larry 


23 


Rowell 


02 


Bayani 


13 


Lizzy 


24 


Salangsang 


03 


Cheng 


14 


Macierz 


25 


Slade 


04 


Cuarismo 


15 


Motogawa 


26 


Stracher 


05 


Cuningham 


16 


Okimoto 


27 


Tallai 


06 


Fontecha 


17 


Patel 


28 


Tran 


07 


Hong 


18 


Price 


29 


Wai 


08 


Hoobler 


19 


Quizon 


30 


Wood 


09 


Jiao 


20 


Reyes 


31 


Yogi 


10 


Khan 


21 


Roquero 


32 


Zoe 


11 


King 


22 


Roth 







Table 1.1 



Lisa can either use a table of random numbers (found in many statistics books as well as mathematical 
handbooks) or a calculator or computer to generate random numbers. For this example, suppose Lisa 
chooses to generate random numbers by using R. She enters the statement sample (32 , 3) , where 32 is the 
number of students in the class (excluding herself), and 3 is the number of samples she wants. If she wants 
her random sample to be replicable, she needs to set a seed for her sample by using the set . seed() as 
demonstrated in the second example. When you try this exercise, you should get different results if you are 
not using a seed value or if you are using a different seed value from the one in the example code. 



4 This content is available online at <http://cnx.Org/content/m35067/l. l/>. 



TIP: Use set . seed() whenever you want to be able to reproduce your results. You can, for in- 
stance, set the seed at the date that you are first running your experiment. For example, if your 
first experiment was being done on August 1 2010, you might write 20100801 for your seed. Every 
time you re-run your experiment use the same date from your original experiment as your seed, 
and your output will be the same. 



# A random sample of 3 from 32 

> sample (32, 3) 
[1] 19 3 23 

# Using set.seedO to get a reproducible sample 

> set . seed(123) # The seed can be any number you want 

> sample (32, 3) 
[1] 10 25 13 

Using this information, Lisa will select the students with the ID numbers generated by R. 

Sometimes, it is difficult or impossible to obtain a simple random sample because populations are too 
large. Then we choose other forms of sampling methods that involve a chance process for getting the 
sample. Other well-known random sampling methods are the stratified sample, the cluster sample, and 
the systematic sample. 

To choose a stratified sample, divide the population into groups called strata and then take a sam- 
ple from each stratum. For example, you could stratify (group) your college population by department 
and then choose a simple random sample from each stratum (each department) to get a stratified random 
sample. To choose a simple random sample from each department, number each member of the first de- 
partment, number each member of the second department and do the same for the remaining departments. 
Then use simple random sampling to choose numbers from the first department and do the same for each 
of the remaining departments. Those numbers picked from the first department, picked from the second 
department and so on represent the members who make up the stratified sample. 

To choose a cluster sample, divide the population into strata and then randomly select some of the 
strata. All the members from these strata are in the cluster sample. For example, if you randomly sample 
four departments from your stratified college population, the four departments make up the cluster sample. 
You could do this by numbering the different departments and then choose four different numbers using 
simple random sampling. All members of the four departments with those numbers are the cluster sample. 

To choose a systematic sample, randomly select a starting point and take every nth piece of data from a 
listing of the population. For example, suppose you have to do a phone survey. Your phone book contains 
20,000 residence listings. You must choose 400 names for the sample. Number the population 1 - 20,000 
and then use a simple random sample to pick a number that represents the first name of the sample. Then 
choose every 50th name thereafter until you have a total of 400 names (you might have to go back to the of 
your phone list). Systematic sampling is frequently chosen because it is a simple method. 

A type of sampling that is nonrandom is convenience sampling. Convenience sampling involves using 
results that are readily available. For example, a computer software store conducts a marketing study by 
interviewing potential customers who happen to be in the store browsing through the available software. 
The results of convenience sampling may be very good in some cases and highly biased (favors certain 
outcomes) in others. 

Sampling data should be done very carefully. Collecting data carelessly can have devastating results. 
Surveys mailed to households and then returned may be very biased (for example, they may favor a certain 
group). It is better for the person conducting the survey to select the sample respondents. 

In reality, simple random sampling should be done with replacement That is, once a member is picked 
that member goes back into the population and thus may be chosen more than once. This is true random 
sampling. However for practical reasons, in most populations, simple random sampling is done without 
replacement. That is, a member of the population may be chosen only once. Most samples are taken from 
large populations and the sample tends to be small in comparison to the population. Since this is the case, 



8 CHAPTER 1. SAMPLING AND DATA 

sampling without replacement is approximately the same as sampling with replacement because the chance 
of picking the same sample more than once using with replacement is very low. 

For example, in a college population of 10,000 people, suppose you want to pick a sample of 1000 for a 
survey. For any particular sample of 1000, if you are sampling with replacement, 

• the chance of picking the first person is 1000 out of 10,000 (0.1000); 

• the chance of picking a different second person for this sample is 999 out of 10,000 (0.0999); 

• the chance of picking the same person again is 1 out of 10,000 (very low). 

If you are sampling without replacement, 

• the chance of picking the first person for any particular sample is 1000 out of 10,000 (0.1000); 

• the chance of picking a different second person is 999 out of 9,999 (0.0999); 

• you do not replace the first person before picking the next person. 

Compare the fractions 999/10,000 and 999/9,999. For accuracy, carry the decimal answers to 4 place deci- 
mals. To 4 decimal places, these numbers are equivalent (0.0999). 

Sampling without replacement instead of sampling with replacement only becomes a mathematics issue 
when the population is small which is not that common. For example, if the population is 25 people, the 
sample is 10 and you are sampling with replacement for any particular sample, 

• the chance of picking the first person is 10 out of 25 and a different second person is 9 out of 25 (you 
replace the first person). 

If you sample without replacement, 

• the chance of picking the first person is 10 out of 25 and then the second person (which is different) is 
9 out of 24 (you do not replace the first person). 

Compare the fractions 9/25 and 9/24. To 4 decimal places, 9/25 = 0.3600 and 9/24 = 0.3750. To 4 decimal 
places, these numbers are not equivalent. 

You can also use R to sample with replacement by adding replace=T to the sample () function. Imagine, 
for instance, that you want to replicate flipping a coin 20 times. Since there is only one heads and one tails 
in our population, we use replacement to get our sample. In the example below, we are again using the 
set . seed() function so that you can confirm that you are getting the same results. 

> coin. flips = c("H", "T") 

> set. seed (123) 

> sample (coin. flips, 30, replace=T) 

Til "H" "T" "H" "T" "t" "H" "T" "T" iitii "u" iitii "u" iitii icru "u" iitii "u" "H" "H" 

roQ"! iitii iitii iitii iitii iitii iitii iitii iitii iitii "U" "H" 

When you analyze data, it is important to be aware of sampling errors and nonsampling errors. The 
actual process of sampling causes sampling errors. For example, the sample may not be large enough or 
representative of the population. Factors not related to the sampling process cause nonsampling errors. A 
defective counting device can cause a nonsampling error. 

If we were to examine two samples representing the same population, they would, more than likely, not 
be the same. Just as there is variation in data, there is variation in samples. As you become accustomed to 
sampling, the variability will seem natural. 



1.4.1 Optional Collaborative Classroom Exercise 

Exercise 1.1 

As a class, determine whether or not the following samples are representative. If they are not, 
discuss the reasons. 

1 . To find the average GPA of all students in a university, use all honor students at the univer- 
sity as the sample. 

2. To find out the most popular cereal among young people under the age of 10, stand outside 
a large supermarket for three hours and speak to every 20th child under age 10 who enters 
the supermarket. 

3. To find the average annual income of all adults in the United States, sample U.S. congress- 
men. Create a cluster sample by considering each state as a stratum (group). By using simple 
random sampling, select states to be part of the cluster. Then survey every U.S. congressman 
in the cluster. 

4. To determine the proportion of people taking public transportation to work, survey 20 peo- 
ple in New York City. Conduct the survey by sitting in Central Park on a bench and inter- 
viewing every person who sits next to you. 

5. To determine the average cost of a two day stay in a hospital in Massachusetts, survey 100 
hospitals across the state using simple random sampling. 



1.5 Variation and Critical Evaluation 5 

1.5.1 Variation in Data 

Variation is present in any set of data. For example, 16-ounce cans of beverage may contain more or less 
than 16 ounces of liquid. In one study, eight 16 ounce cans were measured and produced the following 
amount (in ounces) of beverage: 

15.8, 16.1, 15.2, 14.8, 15.8, 15.9, 16.0, 15.5 

Measurements of the amount of beverage in a 16-ounce can may vary because different people make the 
measurements or because the exact amount, 16 ounces of liquid, was not put into the cans. Manufacturers 
regularly run tests to determine if the amount of beverage in a 16-ounce can falls within the desired range. 

Be aware that as you take data, your data may vary somewhat from the data someone else is taking for 
the same purpose. This is completely natural. However, if two or more of you are taking the same data and 
get very different results, it is time for you and the others to reevaluate your data-taking methods and your 
accuracy. 

1.5.2 Variation in Samples 

It was mentioned previously that two or more samples from the same population and having the same 
characteristics as the population may be different from each other. Suppose Doreen and Jung both decide 
to study the average amount of time students sleep each night and use all students at their college as the 
population. Doreen uses systematic sampling and Jung uses cluster sampling. Doreen's sample will be 
different from Jung's sample even though both samples have the characteristics of the population. Even 
if Doreen and Jung used the same sampling method, in all likelihood their samples would be different. 
Neither would be wrong, however. 

Think about what contributes to making Doreen's and Jung's samples different. 



This content is available online at <http://cnx.org/content/m35071/l.l/>. 



10 



CHAPTER 1 . SAMPLING AND DATA 



If Doreen and Jung took larger samples (i.e. the number of data values is increased), their sample results 
(the average amount of time a student sleeps) would be closer to the actual population average. But still, 
their samples would be, in all likelihood, different from each other. This variability in samples cannot be 
stressed enough. 

1.5.2.1 Size of a Sample 

The size of a sample (often called the number of observations) is important. The examples you have seen 
in this book so far have been small. Samples of only a few hundred observations, or even smaller, are 
sufficient for many purposes. In polling, samples that are from 1200 to 1500 observations are considered 
large enough and good enough if the survey is random and is well done. You will learn why when you 
study confidence intervals. 

1.5.2.2 Optional Collaborative Classroom Exercise 

Exercise 1.2 

Divide into groups of two, three, or four. Your instructor will give each group one 6-sided die. 
Try this experiment twice. Roll one fair die (6-sided) 20 times. Record the number of ones, twos, 
threes, fours, fives, and sixes you get below ("frequency" is the number of times a particular face 
of the die occurs): 

First Experiment (20 rolls) 



Face on Die 


Frequency 


1 




2 




3 




4 




5 




6 





Table 1.2 
Second Experiment (20 rolls) 



Face on Die 


Frequency 


1 




2 




3 




4 




5 




6 





Table 1.3 



Did the two experiments have the same results? Probably not. If you did the experiment a 
third time, do you expect the results to be identical to the first or second experiment? (Answer yes 
or no.) Why or why not? 



11 



Which experiment had the correct results? They both did. The job of the statistician is to see 
through the variability and draw appropriate conclusions. 



1.5.2.3 Running this experiment in R 

While it is much more interesting to experiment using a real die, we can also simulate it in R. To do this, we 
first create an object in R with the numbers 1 through 6 (representing the six faces on a die). After that, we 
can use the sample () function to take as many samples as we require, adding, of course, replace = T to 
use sampling with replacement. 



# Six faces on a die 

> die. face = c(l:6) 

> set.seed(123) 

> aa = sample (die .face, 20, replace = T) 

# Create a frequency table to analyze the results 

# The first row shows the number on the die face 

# The second row shows the frequency at which that number was rolled 

> table (aa) 
aa 

12 3 4 5 6 
3 3 3 3 2 6 

> set.seed(456) 

> bb = sample (die .face, 20, replace = T) 

> table (bb) 
bb 

12 3 4 5 6 
2 6 3 2 5 2 

Notice the difference in the results of table aa and table bb. You can expect similar variation in your results 
when manually rolling the die. 

1.5.3 Critical Evaluation 

We need to critically evaluate the statistical studies we read about and analyze before accepting the results 
of the study. Common problems to be aware of include 

• Problems with Samples: A sample should be representative of the population. A sample that is not 
representative of the population is biased. Biased samples that are not representative of the popula- 
tion give results that are inaccurate and not valid. 

• Self -Selected Samples: Responses only by people who choose to respond, such as call-in surveys are 
often unreliable. 

• Sample Size Issues: Samples that are too small may be unreliable. Larger samples are better if possible. 
In some situations, small samples are unavoidable and can still be used to draw conclusions, even 
though larger samples are better. Examples: Crash testing cars, medical testing for rare conditions. 

• Undue influence: Collecting data or asking questions in a way that influences the response. 

• Non -response or refusal of subject to participate: The collected responses may no longer be represen- 
tative of the population. Often, people with strong positive or negative opinions may answer surveys, 
which can affect the results. 

• Causality: A relationship between two variables does not mean that one causes the other to occur. 
They may both be related (correlated) because of their relationship through a different variable. 



12 



CHAPTER 1 . SAMPLING AND DATA 



Self-Funded or Self-interest Studies: A study performed by a person or organization in order to sup- 
port their claim. Is the study impartial? Read the study carefully to evaluate the work. Do not 
automatically assume that the study is good but do not automatically assume the study is bad either. 
Evaluate it on its merits and the work done. 

Misleading Use of Data: Improperly displayed graphs, incomplete data, lack of context. 
Confounding: When the effects of multiple factors on a response cannot be separated. Confounding 
makes it difficult or impossible to draw valid conclusions about the effect of each factor. 



1.6 Frequency, Relative Frequency, and Cumulative Frequency 6 

Twenty students were asked how many hours they worked per day. Their responses, in hours, are listed 
below, followed by a frequency table listing the different data values in ascending order and their frequen- 
cies. 

5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3 



Frequency Table of Student Work Hours 



Data Value 


Frequency 


2 


3 


3 


5 


4 


3 


5 


6 


6 


2 


7 


1 



Table 1.4 

A frequency is the number of times a given datum occurs in a data set. According to the table above, 
there are three students who work 2 hours, five students who work 3 hours, etc. The total of the frequency 
column, 20, represents the total number of students included in the sample. A relative frequency is the 
fraction of times an answer occurs. To find the relative frequencies, divide each frequency by the total 
number of students in the sample - in this case, 20. Relative frequencies can be written as fractions, percents, 
or decimals. Cumulative relative frequency is the accumulation of the previous relative frequencies. To 
find the cumulative relative frequencies, add all the previous relative frequencies to the relative frequency 
for the current row. 

Frequency Table of Student Work Hours w/ Relative and Cumulative Relative Frequency 



Data Value 


Frequency 


Relative Frequency 


Cumulative Relative Frequency 


2 


3 


^ or 0.15 


0.15 


3 


5 


^ or 0.25 


0.15 + 0.25 = 0.40 


4 


3 


^ or 0.15 


0.40 + 0.15 = 0.55 


5 


6 


^ or 0.30 


0.55 + 0.30 = 0.85 


6 


2 


^ or 0.10 


0.85 + 0.10 = 0.95 


7 


1 


^ or 0.05 


0.95 + 0.05 = 1.00 



6 This content is available online at <http://cnx.org/content/m35072/13/>. 



13 

Table 1.5 

NOTE: 

• The sum of the relative frequency column is |jj, or 1. 

• The last entry of the cumulative relative frequency column is one, indicating that one hun- 
dred percent of the data has been accumulated. 

• Because of rounding, the relative frequency column may not always sum to one and the 
last entry in the cumulative relative frequency column may not be one. However, they each 
should be close to one. 

1.6.1 Using R for Calculating Frequency, Cumulative Frequency, Relative Frequency, 
and Cumulative Relative Frequency 

As you can imagine, it is pretty straightforward to do something like this in R. The following functions 
apply: 

1. Frequencies: table (). 

2. Relative frequencies: tableO divided by length () (which tells us how many items are in an R object). 

3. Cumulative frequencies: First we create intervals using cut (), then we use cumsumO . Note: Set your 
cut () breaks ranging from one below your minimum value to your maximum value. 

4. Cumulative relative frequencies: First, calculate your cumulative frequencies (see item 3) and divide 
that by the total number of observations (obtained using length ( ) ). 



# Entering the data 

> hours . worked = c(5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3) 

# A general frequency table 

> table (hours .worked) 
hours . worked 

2 3 4 5 6 7 

3 5 3 6 2 1 

# Relative frequency table 

> table (hours .worked) /length (hours .worked) 
hours. worked 

2 3 4 5 6 7 
0.15 0.25 0.15 0.30 0.10 0.05 

# To get cumulative frequencies, we need to put the hours into different intervals 

> x = table (cut (hours .worked, breaks = c(l:7))) 

# Cumulative frequencies 

> cumsum(x) 

(1,2] (2,3] (3,4] (4,5] (5,6] (6,7] 
3 8 11 17 19 20 

# Cumulative relative frequencies 

> cumsum(x) /length (hours .worked) 
(1,2] (2,3] (3,4] (4,5] (5,6] (6,7] 

0.15 0.40 0.55 0.85 0.95 1.00 



14 



CHAPTER 1 . SAMPLING AND DATA 



The following table represents the heights, in inches, of a sample of 100 male semiprofessional soccer play- 
ers. 

Frequency Table of Soccer Player Height 



HEIGHTS (INCHES) 


FREQUENCY OF STU- 
DENTS 


RELATIVE FRE- 
QUENCY 


CUMULATIVE RELA- 
TIVE FREQUENCY 


59.95-61.95 


5 


4 = 0.05 


0.05 


61.95-63.95 


3 


4=0.03 


0.05 + 0.03 = 0.08 


63.95 - 65.95 


15 


— - 15 
100 U - 1J 


0.08 + 0.15 = 0.23 


65.95 - 67.95 


40 


^2- -0 40 
100 u -^ u 


0.23 + 0.40 = 0.63 


67.95 - 69.95 


17 


— - 17 
100 u - iy 


0.63 + 0.17 = 0.80 


69.95-71.95 


12 


4=0.12 


0.80 + 0.12 = 0.92 


71.95-73.95 


7 


OT = °- 07 


0.92 + 0.07 = 0.99 


73.95 - 75.95 


1 


TOO = °- 01 


0.99 + 0.01 = 1.00 




Total = 100 


Total = 1.00 





Table 1.6 

The data in this table has been grouped into the following intervals: 

• 59.95 -61.95 inches 

• 61.95 -63.95 inches 

• 63.95 - 65.95 inches 

• 65.95 - 67.95 inches 

• 67.95 - 69.95 inches 

• 69.95 -71.95 inches 

• 71.95 -73.95 inches 

• 73.95 - 75.95 inches 



NOTE: This example is used again in the Descriptive Statistics 7 chapter, where the method used to 
compute the intervals will be explained. 

In this sample, there are 5 players whose heights are between 59.95 - 61.95 inches, 3 players whose heights 
fall within the interval 61.95 - 63.95 inches, 15 players whose heights fall within the interval 63.95 - 65.95 
inches, 40 players whose heights fall within the interval 65.95 - 67.95 inches, 17 players whose heights 
fall within the interval 67.95 - 69.95 inches, 12 players whose heights fall within the interval 69.95 - 71.95, 
7 players whose height falls within the interval 71.95 - 73.95, and 1 player whose height falls within the 
interval 73.95 - 75.95. All heights fall between the endpoints of an interval and not at the endpoints. 

Example 1.5 

From the table, find the percentage of heights that are less than 65.95 inches. 

Solution 

If you look at the first, second, and third rows, the heights are all less than 65.95 inches. There are 
5 + 3 + 15 = 23 males whose heights are less than 65.95 inches. The percentage of heights less than 
65.95 inches is then 4 or 23%. This percentage is the cumulative relative frequency entry in the 
third row. 



"Descriptive Statistics: Introduction" <http://cnx.org/content/ml6300/latest/> 



15 



Example 1.6 

From the table, find the percentage of heights that fall between 61.95 and 65.95 inches. 

Solution 

Add the relative frequencies in the second and third rows: 0.03 + 0.15 = 0.18 or 18%. 



Example 1.7 

Use the table of heights of the 100 male semiprofessional soccer players. Fill in the blanks and 
check your answers. 

1. The percentage of heights that are from 67.95 to 71.95 inches is: 

2. The percentage of heights that are from 67.95 to 73.95 inches is: 

3. The percentage of heights that are more than 65.95 inches is: 

4. The number of players in the sample who are between 61.95 and 71.95 inches tall is: 

5. What kind of data are the heights? 

6. Describe how you could gather this data (the heights) so that the data are characteristic of all 
male semiprofessional soccer players. 

Remember, you count frequencies. To find the relative frequency, divide the frequency by the 
total number of data values. To find the cumulative relative frequency, add all of the previous 
relative frequencies to the relative frequency for the current row. 



1.7 Practice 8 

1.7.1 Student Learning Outcomes 

• The student will practice constructing frequency tables. 

• The student will compare sampling techniques. 



1.7.2 Given 

Studies are often done by pharmaceutical companies to determine the effectiveness of a treatment program. 
Suppose that a new AIDS antibody drug is currently under study. It is given to patients once the AIDS 
symptoms have revealed themselves. Of interest is the average length of time in months patients live once 
starting the treatment. Two researchers each follow a different set of 40 AIDS patients from the start of 
treatment until their deaths. The following data (in months) are collected. 

Researcher 1: 3, 4, 11, 15, 16, 17, 22, 44, 37, 16, 14, 24, 25, 15, 26, 27, 33, 29, 35, 44, 13, 21, 22, 10, 12, 8, 40, 
32, 26, 27, 31, 34, 29, 17, 8, 24, 18, 47, 33, 34 

Researcher 2: 3, 14, 11, 5, 16, 17, 28, 41, 31, 18, 14, 14, 26, 25, 21, 22, 31, 2, 35, 44, 23, 21, 21, 16, 12, 18, 41, 
22, 16, 25, 33, 34, 29, 13, 18, 24, 23, 42, 33, 29 



8 This content is available online at <http://cnx.org/content/m35074/1.2/>. 



16 



CHAPTER 1 . SAMPLING AND DATA 



1.7.3 Organize the Data 

Complete the tables below using the data provided. 

Researcher 1 



Survival Length (in 
months) 


Frequency 


Relative Frequency 


Cumulative Rel. Fre- 
quency 


0.5-6.5 








6.5-12.5 








12.5 - 18.5 








18.5 - 24.5 








24.5 - 30.5 








30.5 - 36.5 








36.5 - 42.5 








42.5-48.5 









Table 1.7 
Researcher 2 



Survival Length (in 
months) 


Frequency 


Relative Frequency 


Cumulative Rel. Fre- 
quency 


0.5 - 6.5 








6.5-12.5 








12.5 - 18.5 








18.5 - 24.5 








24.5 - 30.5 








30.5 - 36.5 








36.5 - 42.5 








42.5-48.5 









Table 1.8 



1.7.4 Discussion Questions 

Discuss the following questions and then answer in complete sentences. 

Exercise 1.3 

List two reasons why the data may differ. Would you expect the data to be identical? Why or why 
not? 

Exercise 1.4 

Can you tell if one researcher is correct and the other one is incorrect? Why? 

Exercise 1.5 

How could the researchers gather random data? 



17 



Exercise 1.6 

Suppose that the first researcher conducted his survey by randomly choosing one state in the 
nation and then randomly picking 40 patients from that state. What sampling method would that 
researcher have used? 

Exercise 1.7 

Suppose that the second researcher conducted his survey by choosing 40 patients he knew. What 
sampling method would that researcher have used? What concerns would you have about this 
data set, based upon the data collection method? 



18 CHAPTER 1. SAMPLING AND DATA 

Solutions to Exercises in Chapter 1 

Solution to Example 1.7, Problem (p. 15) 

1. 29% 

2. 36% 

3. 77% 

4. 87 

5. quantitative continuous 

6. get rosters from each team and choose a simple random sample from each 



Chapter 2 

Descriptive Statistics 



2.1 Introduction 1 

2.1.1 Student Learning Objectives 

By the end of this chapter, the student should be able to: 

• Display data graphically and interpret graphs: stemplots, histograms and boxplots. 

• Recognize, describe, and calculate the measures of location of data: quartiles and percentiles. 

• Recognize, describe, and calculate the measures of the center of data: mean, median, and mode. 

• Recognize, describe, and calculate the measures of the spread of data: variance, standard deviation, 
and range. 

• Use R to generate basic graphs that help to interpret data. 

The main R functions that will be used in this chapter include barplot () , boxplot () , hist () , seq() , and 
stemO . 

2.1.2 Introduction 

Once you have collected data, what will you do with it? Data can be described and presented in many 
different formats. For example, suppose you are interested in buying a house in a particular area. You may 
have no clue about the house prices, so you might ask your real estate agent to give you a sample data set 
of prices. Looking at all the prices in the sample often is overwhelming. A better way might be to look 
at the median price and the variation of prices. The median and variation are just two ways that you will 
learn to describe data. Your agent might also provide you with a graph of the data. 

In this chapter, you will study numerical and graphical ways to describe and display your data. This 
area of statistics is called "Descriptive Statistics". You will learn to calculate, and even more importantly, 
to interpret these measurements and graphs. 

2.2 Stem and Leaf Graphs and Bar Graphs 2 

One simple graph, the stem-and-leaf graph or stemplot, comes from the field of exploratory data analysis.lt 
is a good choice when the data sets are small. To create the plot, divide each observation of data into a stem 
and a leaf. The leaf consists of one digit. For example, 23 has stem 2 and leaf 3. Four hundred thirty-two 
(432) has stem 43 and leaf 2. Five thousand four hundred thirty-two (5,432) has stem 543 and leaf 2. The 



x This content is available online at <http://cnx.Org/content/m35079/l. l/>. 
2 This content is available online at <http://cnx.Org/content/m35075/l.2/>. 



19 



20 



CHAPTER 2. DESCRIPTIVE STATISTICS 



decimal 9.3 has stem 9 and leaf 3. Write the stems in a vertical line from smallest the largest. Draw a vertical 
line to the right of the stems. Then write the leaves in increasing order next to their corresponding stem. 

Example 2.1 

For Susan Dean's spring pre-calculus class, scores for the first exam were as follows (smallest to 
largest): 

33, 42, 49, 49, 53, 55, 55, 61, 63, 67, 68, 68, 69, 69, 72, 73, 74, 78, 80, 83, 88, 88, 88, 90, 92, 94, 94, 94, 
94, 96, 100 

Stem-and-Leaf Diagram 



Stem 


Leaf 


3 


3 


4 


299 


5 


355 


6 


1378899 


7 


2348 


8 


03888 


9 


0244446 


10 






Table 2.1 

The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31 scores 
or approximately 26% of the scores were in the 90's or 100, a fairly high number of As. 

The stemplot is a quick way to graph and gives an exact picture of the data. You want to look for an overall 
pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is 
sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the 
graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500) while others may 
indicate that something unusual is happening. It takes some background information to explain outliers. 
In the example above, there were no outliers. 

Example 2.2 

Create a stem plot using the data: 

1.1, 1.5, 2.3, 2.5, 2.7, 3.2, 3.3, 3.3, 3.5, 3.8, 4.0, 4.2, 4.5, 4.5, 4.7, 4.8, 5.5, 5.6, 6.5, 6.7, 12.3 
The data are the distance (in kilometers) from a home to the nearest supermarket. 

Problem (Solution on p. 32.) 

1. Are there any outliers? 

2. Do the data seem to have any concentration of values? 



HINT: The leaves are to the right of the decimal. 



21 

2.2.1 Stem and Leaf Plots in R 

The stem() function in R is used to create stem and leaf plots. For most purposes, you do not need to pass 
on any other arguments; however, you may occasionally, need to use the scale= argument to get a more 
usable stem and leaf plot. Notice that R tells you where the decimal is in its output so that you can easily 
figure out the original values. 

> exam. scores = c(33, 42, 49, 49, 53, 55, 55, 61, 63, 67, 68, 68, 69, 69, 
+ 72, 73, 74, 78, 80, 83, 88, 88, 88, 90, 92, 94, 94, 94, 94, 96, 100) 

> stem(exam. scores) 

The decimal point is 1 digit (s) to the right of the I 



3 


3 


4 


299 


5 


355 


6 


1378899 


7 


2348 


8 


03888 


9 


0244446 









Bar graphs consist of bars that are separated from each other. The bars can be rectangles or they can be be 
rectangular boxes and they can be vertical or horizontal. The bar graph shown in Example 4 uses the data 
of Example 3. Frequencies are represented by the the heights of the bars. 

Example 2.3 

In a survey, 40 mothers were asked how many times per week a teenager must be reminded to do 
his/her chores. The results are shown in the table and the bar graph. 



Number of times teenager is reminded 


Frequency 





2 


1 


5 


2 


8 


3 


14 


7 


7 


5 


4 



Table 2.2 



Example 2.4 



22 



CHAPTER 2. DESCRIPTIVE STATISTICS 



16 - 






14 - 


■ 




12 - 






10 - 






Frequency 8 


l l ■ 




4 - 
2 - 

- 


-■III 


1 


12 3 4 


5 




Number of Times Teenager is 

Reminded 





Figure 2.1 



2.2.2 Bar Graphs in R 

To create a bar graph in R, you need to use the barplot () function. There are several useful arguments for 
this function including: 

• names . arg : An object representing the values to be placed below each bar (usually the x-axis). 

• xlab : The title for your x-axis. 

• ylab : The title for your y-axis. 

• main : The title of your graph. 



> reminders = c(0, 1, 2, 3, 7, 5) 

> frequency = c(2, 5, 8, 14, 7, 4) 

> barplot (frequency, names . arg=reminders, xlab="Number of reminders", 
+ ylab="Frequency of responses", 

+ main="Number of times each week 

+ that a teenager is reminded to do chores") 



23 



Number of times each week 
that a teenager is reminded to do chores 



1 2 3 

Number of reminders 



Figure 2.2 



The bar graph shown in Example 5 has age groups represented on the x-axis and proportions on the 
y-axis. 

Example 2.5 

By the end of March 2009, in the United States Facebook had over 56 million users. The table 

shows the age groups, the number of users in each age group and the proportion (%) of users 

in each age group. Source: http://www.insidefacebook.com/2009/03/25/number-of-us-facebook- 

users-over-35-nearly-doubles-in-last-60-days/ 



Age groups 


Number of Facebook users 


Proportion (%) of Facebook users 


13-25 


25,510,040 


46% 


26-44 


23,123,900 


41% 


45-65 


7,431,020 


13% 



Table 2.3 



24 



CHAPTER 2. DESCRIPTIVE STATISTICS 




Figure 2.3 



2.3 Histograms 3 

For most of the work you do in this book, you will use a histogram to display the data. One advantage of a 
histogram is that it can readily display large data sets. A rule of thumb is to use a histogram when the data 
set consists of 100 values or more. 

A histogram consists of contiguous boxes. It has both a horizontal axis and a vertical axis. The hori- 
zontal axis is labeled with what the data represents (for instance, distance from your home to school). The 
vertical axis is labeled either "frequency" or "relative frequency". The graph will have the same shape with 
either label. Frequency is commonly used when the data set is small and relative frequency is used when 
the data set is large or when we want to compare several distributions. The histogram (like the stemplot) 
can give you the shape of the data, the center, and the spread of the data. 

The relative frequency is equal to the frequency for an observed value of the data divided by the total 
number of data values in the sample. (In the chapter on Sampling and Data 4 , we defined frequency as the 
number of times an answer occurs.) If: 



• / = frequency 

• n = total number of data values (or the sum of the individual frequencies), and 



RF = relative frequency, 



then: 



RF= J - 
n 

For example, if 3 students in Mr. Ahab's English class of 40 students received an A, then, 



(2.1) 



/ = 3 , n = 40 , and RF 



40 



0.075 



3 This content is available online at <http://cnx.Org/content/m35077/l. l/>. 
4 "Sampling and Data: Introduction" <http://cnx.org/content/ml6008/latest/> 



25 

Seven and a half percent of the students received an A. 

To construct a histogram, first decide how many bars or intervals, also called classes, represent the 
data. Many histograms consist of from 5 to 15 bars or classes for clarity. Choose a starting point for the first 
interval to be less than the smallest data value. A convenient starting point is a lower value carried out 
to one more decimal place than the value with the most decimal places. For example, if the value with the 
most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 - 0.05 = 6.05). 
We say that 6.05 has more precision. If the value with the most decimal places is 2.23 and the lowest value 
is 1.5, a convenient starting point is 1.495 (1.5 - 0.005 = 1.495). If the value with the most decimal places is 
3.234 and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 - .0005 = 0.9995). If all the data 
happen to be integers and the smallest value is 2, then a convenient starting point is 1.5 (2 - 0.5 = 1.5). Also, 
when the starting point and other boundaries are carried to one additional decimal place, no data value 
will fall on a boundary. 

Example 2.6 

The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional 
soccer players. The heights are continuous data since height is measured. 

60, 60.5, 61, 61, 61.5, 63.5, 63.5, 63.5, 64, 64, 64, 64, 64, 64, 64, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 
64.5, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 67, 
67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 68, 68, 69, 69, 69, 69, 69, 
69, 69, 69, 69, 69, 69.5, 69.5, 69.5, 69.5, 69.5, 70, 70, 70, 70, 70, 70, 70.5, 70.5, 70.5, 71, 71, 71, 72, 72, 72, 
72.5, 72.5, 73, 73.5, 74 

The smallest data value is 60. Since the data with the most decimal places has one decimal 
(for instance, 61.5), we want our starting point to have two decimal places. Since the numbers 0.5, 
0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for 
the convenient starting point. 

60 - 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point 
is, then, 59.95. The largest value is 74. 74+ 0.05 = 74.05 is the ending value. 

Next, calculate the width of each bar or class interval. To calculate this width, subtract the 
starting point from the ending value and divide by the number of bars (you must choose the 
number of bars you desire). Suppose you choose 8 bars. 

74.05 - 59.95 , „, 

= 1.76 (2.2) 



NOTE: We will round up to 2 and make each bar or class interval 2 units wide. Rounding up to 2 is 
one way to prevent a value from falling on a boundary. For this example, using 1.76 as the width 
would also work. 

The boundaries are: 

59.95 

59.95 + 2 = 61.95 
61.95 + 2 = 63.95 
63.95 + 2 = 65.95 
65.95 + 2 = 67.95 
67.95 + 2 = 69.95 
69.95 + 2 = 71.95 
71.95 + 2 = 73.95 
73.95 + 2 = 75.95 

The heights 60 through 61.5 inches are in the interval 59.95 - 61.95. The heights that are 63.5 are 
in the interval 61.95 - 63.95. The heights that are 64 through 64.5 are in the interval 63.95 - 65.95. 
The heights 66 through 67.5 are in the interval 65.95 - 67.95. The heights 68 through 69.5 are in the 



26 



CHAPTER 2. DESCRIPTIVE STATISTICS 



interval 67.95 - 69.95. The heights 70 through 71 are in the interval 69.95 - 71.95. The heights 72 
through 73.5 are in the interval 71.95 - 73.95. The height 74 is in the interval 73.95 - 75.95. 

The following histogram displays the heights on the x-axis and relative frequency on the y- 



axis. 



Relative 
Frequency 



0.4 


0.05 


0.03 


0.15 


0.4 


0.17 


0.12 


0.07 




0.35 






0.3 




0.25 




0.2 




0.15 








0.1 










0.05 


0.01 


















59.95 61.95 63.95 65.95 67.95 69.95 71.95 73.95 75.95 
Heights 

Figure 2.4 



2.3.1 Creating a Histogram in R 

Because histograms are so often used in statistical analysis, as you can imagine, R is able to generate his- 
tograms quite easily. The functions you will use are seq() to generate the required intervals and hist () for 
generating the histogram. Additionally, you may use the following arguments with the hist () function: 

• breaks : Used to tell R how many breaks the histogram should have or where the intervals should be. 

• xlab : This will add a label to your x-axis. 

• ylab : This will add a label to your y-axis. 

• main : Used to add a chart title. 



> player .height 

+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ 



c(60, 60.5, 61, 61, 61.5, 63.5, 63.5, 63.5, 64, 64, 64, 64, 
64, 64, 64, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 

66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66.5, 66.5, 66.5, 
66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 67, 67, 67, 

67, 67, 67, 67, 67, 67, 67, 67, 67, 67.5, 67.5, 67.5, 67.5, 
67.5, 67.5, 67.5, 68, 68, 69, 69, 69, 69, 69, 69, 69, 69, 69, 
69, 69.5, 69.5, 69.5, 69.5, 69.5, 70, 70, 70, 70, 70, 70, 
70.5, 70.5, 70.5, 71, 71, 71, 72, 72, 72, 72.5, 72.5, 73, 
73.5, 74) 



27 



> hist. breaks = seq(from = 59.95, by = 2, length = 9) 

> hist (player .height , breaks = hist. breaks, xlab = "Player Heights", 

+ ylab = "Frequency", main = "Heights (in inches) of 100 Soccer Players") 



Heights (in inches) of 100 Soccer Players 



60 



65 



70 



75 



Player Heights 



Figure 2.5 



Example 2.7 

The following data are the number of books bought by 50 part-time college students at ABC 
College. The number of books is discrete data since books are counted. 

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 
4, 5, 5, 5, 5, 5, 6, 6 

Eleven students buy 1 book. Ten students buy 2 books. Sixteen students buy 3 books. Six 
students buy 4 books. Five students buy 5 books. Two students buy 6 books. 

Because the data are integers, subtract 0.5 from 1, the smallest data value and add 0.5 to 6, the 
largest data value. Then the starting point is 0.5 and the ending value is 6.5. 

Problem (Solution on p. 32.) 

Calculate the width of each bar or class interval. If the data are discrete and there are not too many 
different values, a width that places the data values in the middle of the bar or class interval is the 
most convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6 and the starting point is 0.5, 
a width of one places the 1 in the middle of the interval from 0.5 to 1.5, the 2 in the middle of the 
interval from 1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, the 4 in the middle of the 

interval from to , the 5 in the middle of the interval from to , and 

the in the middle of the interval from to . 



28 



CHAPTER 2. DESCRIPTIVE STATISTICS 



Calculate the number of bars as follows: 



6.5 - 0.5 

bars 



1 



(2.3) 



where 1 is the width of a bar. Therefore, bars — 6. 

The following histogram displays the number of books on the x-axis and the frequency on the 
y-axis. 



I6_ 




0.5 1.5 2.5 3.5 4.5 

Number of Books 



5.5 



6.5 



Figure 2.6 



2.4 Box Plots 5 



Box plots or box-whisker plots give a good graphical image of the concentration of the data. They also 
show how far from most of the data the extreme values are. The box plot is constructed from five values: 
the smallest value, the first quartile, the median, the third quartile, and the largest value. The median, the 
first quartile, and the third quartile will be discussed here, and then again in the section on measuring data 
in this chapter. We use these values to compare how close other data values are to them. 

The median, a number, is a way of measuring the "center" of the data. You can think of the median as 
the "middle value," although it does not actually have to be one of the observed values. It is a number that 
separates ordered data into halves. Half the values are the same number or smaller than the median and 
half the values are the same number or larger. For example, consider the following data: 

1, 11.5, 6, 7.2, 4, 8, 9, 10, 6.8, 8.3, 2, 2, 10, 1 

Ordered from smallest to largest: 

1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5 



5 This content is available online at <http://cnx.org/content/m35078/!. l/>. 



29 

The median is between the 7th value, 6.8, and the 8th value 7.2. To find the median, add the two values 
together and divide by 2. 

^=7 (2.4) 

The median is 7. Half of the values are smaller than 7 and half of the values are larger than 7. 

Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the 
data. To find the quartiles, first find the median or second quartile. The first quartile is the middle value of 
the lower half of the data and the third quartile is the middle value of the upper half of the data. To get the 
idea, consider the same data set shown above. 

The median or second quartile is 7. The lower half of the data is 1, 1, 2, 2, 4, 6, 6.8. The middle value of 
the lower half is 2. 

1, 1, 2, 2, 4, 6, 6.8 

The number 2, which is part of the data, is the first quartile. One-fourth of the values are the same or 
less than 2 and three-fourths of the values are more than 2. 

The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is 9. 

7.2,8,8.3,9,10,10,11.5 

The number 9, which is part of the data, is the third quartile. Three-fourths of the values are less than 9 
and one-fourth of the values are more than 9. 

To construct a box plot, use a horizontal number line and a rectangular box. The smallest and largest 
data values label the endpoints of the axis. The first quartile marks one end of the box and the third quartile 
marks the other end of the box. The middle fifty percent of the data fall inside the box. The "whiskers" 
extend from the ends of the box to the smallest and largest data values. The box plot gives a good quick 
picture of the data. 

Using the data from the start of this section, recall that the first quartile is 2, the median is 7, and the third 
quartile is 9. The smallest value is 1 and the largest value is 11.5. The box plot is constructed as follows: 



10 11 11.5 



Figure 2.7 



The two whiskers extend from the first quartile to the smallest value and from the third quartile to the 
largest value. The median is shown with a dashed line. 

2.4.1 Making a Box Plot in R 

As with histograms, box plots are very easy to make in R. They are made using the boxplot () function. By 
default, R orients the box plot vertically; to change this, simply add the argument horizontal = T to the 
function. For most simple plots, no additional arguments are required. You may also want to change the 
size of the R graph window to view the box plot in a more aesthetic scale. 

Here is how you would create the box plot in R using the same data from the start of this section. 



30 



CHAPTER 2. DESCRIPTIVE STATISTICS 



> a = c(l, 11.5, 6, 7.2, 4, 8, 9, 10, 

> boxplot(a, horizontal = T) 



.3, 2, 2, 10, 1) 



h- 








1 






1 










1 
4 


1 

6 


1 

3 


10 





Figure 2.8 



Example 2.8 

The following data are the heights of 40 students in a statistics class. 

59, 60, 61, 62, 62, 63, 63, 64, 64, 64, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 67, 67, 68, 68, 69, 70, 70, 
70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 77 

Construct a box plot with the following properties: 

• Smallest value = 59 

• Largest value = 77 

• Ql: First quartile = 64.5 

• Q2: Second quartile or median= 66 

• Q3: Third quartile = 70 



59 



64.5 66 



70 



77 



Figure 2.9 



a. Each quarter has 25% of the data. 

b. The spreads of the four quarters are 64.5 - 59 = 5.5 (first quarter), 66 - 64.5 = 1.5 (second quarter), 

70 - 66 = 4 (3rd quarter), and 77 - 70 = 7 (fourth quarter). So, the second quarter has the 
smallest spread and the fourth quarter has the largest spread. 



31 



c. Interquartile Range: IQR = Q3 - Ql = 70 - 64.5 = 5.5. 

d. The interval 59 through 65 has more than 25% of the data so it has more data in it than the 

interval 66 through 70 which has 25% of the data. 

For some sets of data, some of the largest value, smallest value, first quartile, median, and third 
quartile may be the same. For instance, you might have a data set in which the median and the 
third quartile are the same. In this case, the diagram would not have a dotted line inside the box 
displaying the median. The right side of the box would display both the third quartile and the 
median. For example, if the smallest value and the first quartile were both 1, the median and the 
third quartile were both 5, and the largest value was 7, the box plot would look as follows: 



Figure 2.10 



Example 2.9 

Test scores for a college statistics class held during the day are: 

99, 56, 78, 55.5, 32, 90, 80, 81, 56, 59, 45, 77, 84.5, 84, 70, 72, 68, 32, 79, 90 

Test scores for a college statistics class held during the evening are: 

98, 78, 68, 83, 81, 89, 88, 76, 65, 45, 98, 90, 80, 84.5, 85, 79, 78, 98, 90, 79, 81, 25.5 

Problem (Solution on p. 32.) 

• What are the smallest and largest data values for each data set? 

• What is the median, the first quartile, and the third quartile for each data set? 

• Create a boxplot for each set of data. 

• Which boxplot has the widest spread for the middle 50% of the data (the data between the 
first and third quartiles)? What does this mean for that set of data in comparison to the other 
set of data? 

• For each data set, what percent of the data is between the smallest value and the first quar- 
tile? (Answer: 25%) the first quartile and the median? (Answer: 25%) the median and the 
third quartile? the third quartile and the largest value? What percent of the data is between 
the first quartile and the largest value? (Answer: 75%) 

The first data set (the top box plot) has the widest spread for the middle 50% of the data. IQR = 
Q3 - Ql is 82.5 - 56 = 26.5 for the first data set and 89 - 78 = 11 for the second data set. 
So, the first set of data has its middle 50% of scores more spread out. 

25% of the data is between M and Q3 and 25% is between Q3 and Xmax. 



32 



CHAPTER 2. DESCRIPTIVE STATISTICS 



Solutions to Exercises in Chapter 2 

Solution to Example 2.2, Problem (p. 20) 

The value 12.3 may be an outlier. Values appear to concentrate at 3 and 4 miles. 



Solution to Example 2.7, Problem (p. 27) 

• 3.5 to 4.5 

• 4.5 to 5.5 

• 6 

• 5.5 to 6.5 

Solution to Example 2.9, Problem (p. 31) 
First Data Set 

• Xmin = 32 

• Ql = 56 

• M = 74.5 

• Q3 = 82.5 

• Xmax = 99 



Stem 


Leaf 


1 


15 


2 


357 


3 


33358 


4 


025578 


5 


566 


6 


57 


7 




8 




9 




10 




11 




12 


3 



Table 2.4 



Second Data Set 

• Xmin = 25.5 

• Ql = 78 

• M = 81 

• Q3 = 89 

• Xmax = 98 



33 



20 30 40 50 60 70 80 90 100 



Figure 2.11 



34 GLOSSARY 



Glossary 



A Average 

A number that describes the central tendency of the data. There are a number of specialized 
averages, including the arithmetic mean, weighted mean, median, mode, and geometric mean. 

C Continuous Random Variable 

A random variable (RV) whose outcomes are measured. 

Example: The height of trees in the forest is a continuous RV. 

Cumulative Relative Frequency 

The term applies to an ordered set of observations from smallest to largest. The Cumulative 
Relative Frequency is the sum of the relative frequencies for all values that are less than or equal 
to the given value. 

D Data 

A set of observations (a set of possible outcomes). Most data can be put into two groups: 
qualitative (hair color, ethnic groups and many other attributes of population) and 
quantitative (distance traveled to college, number of children in a family, etc.). In its turn 
quantitative data can be separated into two subgroups: discrete and continuous. Roughly 
speaking, data is discrete if it is result of counting (a number of student of the given ethnic 
group in a class, a number of books on a shelf, etc.), and data is continuous if it is result of 
measuring (distance traveled, weight of luggage, etc.) 

Data 

A set of observations (a set of possible outcomes). Most data can be put into two groups: 
qualitative (hair color, ethnic groups and other attributes of the population) and quantitative 
(distance traveled to college, number of children in a family, etc.). Quantitative data can be 
separated into two subgroups: discrete and continuous. Data is discrete if it is the result of 
counting (the number of students of a given ethnic group in a class, the number of books on a 
shelf, etc.). Data is continuous if it is the result of measuring (distance traveled, weight of 
luggage, etc.) 

Discrete Random Variable 

A random variable (RV) whose outcomes are counted. 



F Frequency 

The number of times a value of the data occurs. 



M Median 



A number that separates ordered data into halves: half the values are the same number or 
smaller than the median and half the values are the same number or larger than the median. 
The median may or may not be part of the data. 



O Outlier 



GLOSSARY 35 

An observation that does not fit the rest of the data. 

P Population 

The collection, or set, of all individuals, objects, or measurements whose properties are being 
studied. 

Proportion 

• As a number: A proportion is the number of successes divided by the total number in the 
sample. 

• As a probability distribution: Given a binomial random variable (RV), X <~B (n, p), consider 
the ratio of the number X of successes in n Bernouli trials to the number n of trials. P' = ^ . 
This new RV is called a proportion, and if the number of trials, n, is large enough, P' 
~N( P/ H). 

Q Qualitative Data 

See Data. 

Quantitative Data 

Quartiles 

The numbers that separate the data into quarters. Quartiles may or may not be part of the data. 
The second quartile is the median of the data. 

R Relative Frequency 

The ratio of the number of times a value of the data occurs in the set of all outcomes to the 
number of all outcomes. 

S Sample 

A portion of the population understudy. A sample is representative if it characterizes the 
population being studied. 



36 



INDEX 



Index of Keywords and Terms 

Keywords are listed by the section with that keyword (page numbers are in parentheses). Keywords 
do not necessarily appear in the text of the page. They are merely associated with that section. Ex. 
apples, § 1.1 (1) Terms are referenced by the page they appear on. Ex. apples, 1 



A average, §1.2(3), 4 

B bar, §2.3(24) 
box, § 2.4(28) 
boxes, § 2.3(24) 

C categorical, § 1.2(3) 
cluster sample, § 1.4(6) 
collaborative, § (1) 
Continuous, § 1.3(5), 5 
Convenience sampling, § 1.4(6) 
Counting, § 1.3(5) 
cumulative, § 1.6(12) 
Cumulative relative frequency, 12 

D Data, § 1.1(3), § 1.2(3), 4, § 1.3(5), § 1.5(9), 
§1.7(15), §2.1(19), §2.3(24) 
descriptive, § 2.2(19) 
Discrete, §1.3(5), 5 

E elementary, § (1), § 2.1(19), § 2.2(19) 

F frequency, § 1.6(12), 12, § 1.7(15), 24 

G graph, § 2.2(19) 

H histogram, § 2.3(24) 

I Introduction, § 1.1(3) 

L leaf, § 2.2(19) 

M measurement, § 1.5(9) 
Measuring, § 1.3(5) 
median, § 2.1(19), § 2.4(28), 28 

N nonsampling errors, § 1.4(6) 
numerical, § 1.2(3) 



O outlier, 20 

P parameter, §1.2(3), 4 
population, § 1.2(3), 3, 9 
practice, §1.7(15) 
proportion, § 1.2(3), 4 

Q Qualitative, § 1.3(5) 
Quantitative, §1.3(5) 
quartiles, § 2.4(28), 29 

R random sampling, § 1.4(6) 
relative, §1.6(12) 
relative frequency, 12, 24 
representative, § 1.2(3) 

S sample, § 1.2(3), 3, § 1.4(6), § 1.5(9) 
samples, 9 

Sampling, § 1.1(3), § 1.2(3), 4, § 1.4(6), § 1.5(9), 
§ 1-7(15) 

sampling errors, § 1.4(6) 
simple random sampling, § 1.4(6) 
size, §1.5(9) 
statistic, §1.2(3), 4 

statistics, § (1), § 1.1(3), § 1.3(5), § 1.4(6), 
§ 1.5(9), § 1.6(12), § 1.7(15), § 2.1(19), § 2.2(19) 
stem, § 2.2(19) 
stemplot, § 2.2(19) 
stratified sample, § 1.4(6) 
systematic sample, § 1.4(6) 

V variability, § 1.5(9) 
variable, §1.2(3), 4 
variation, § 1.5(9) 

W with replacement, § 1.4(6) 
without replacement, § 1.4(6) 



ATTRIBUTIONS 37 

Attributions 

Collection: Collaborative Statistics Using R 

Edited by: Ananda Mahto 

URL: http://cnx.Org/content/colll219/l.7/ 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Preface" 

By: Ananda Mahto 

URL: http://cnx.Org/content/m35060/l.l/ 

Pages: 1-2 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Preface to "Collaborative Statistics" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6026/l.16/ 

Module: "Introduction" 

By: Ananda Mahto 

URL: http://cnx.Org/content/m35069/l.2/ 

Page: 3 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Sampling and Data: Introduction 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.0rg/content/ml6OO8/l.8/ 

Module: "Key Terms" 

By: Ananda Mahto 

URL: http://cnx.Org/content/m35062/l.l/ 

Pages: 3-5 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Sampling and Data: Key Terms 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6007/L14/ 

Module: "Data" 

By: Ananda Mahto 

URL: http://cnx.org/content/m35064/Ll/ 

Pages: 5-6 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Sampling and Data: Data 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6005/L13/ 



38 ATTRIBUTIONS 

Module: "Sampling" 

By: Ananda Mahto 

URL: http://cnx.Org/content/m35067/l.l/ 

Pages: 6-9 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Sampling and Data: Sampling 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.org/content/ml6014/L14/ 

Module: "Variation and Critical Evaluation" 

By: Ananda Mahto 

URL: http://cnx.org/content/m35071/Ll/ 

Pages: 9-12 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Sampling and Data: Variation and Critical Evaluation 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6021/L14/ 

Module: "Frequency, Relative Frequency, and Cumulative Frequency" 

By: Ananda Mahto 

URL: http://cnx.org/content/m35072/L3/ 

Pages: 12-15 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Sampling and Data: Frequency, Relative Frequency, and Cumulative Frequency 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6012/L16/ 

Module: "Practice" 

By: Ananda Mahto 

URL: http://cnx.org/content/m35074/L2/ 

Pages: 15-17 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Sampling and Data: Practice 1 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6016/L12/ 

Module: "Introduction" 

By: Ananda Mahto 

URL: http://cnx.org/content/m35079/Ll/ 

Page: 19 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Descriptive Statistics: Introduction 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6300/L7/ 



ATTRIBUTIONS 39 

Module: "Stem and Leaf Graphs and Bar Graphs" 

By: Ananda Mahto 

URL: http://cnx.Org/content/m35075/l.2/ 

Pages: 19-24 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Descriptive Statistics: Stem and Leaf Graphs (Stemplots), Line Graphs and Bar Graphs 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.Org/content/ml6849/l.8/ 

Module: "Histograms" 

By: Ananda Mahto 

URL: http://cnx.Org/content/m35077/l.l/ 

Pages: 24-28 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Descriptive Statistics: Histogram 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6298/l.ll/ 

Module: "Box Plots" 

By: Ananda Mahto 

URL: http://cnx.Org/content/m35078/l.l/ 

Pages: 28-31 

Copyright: Ananda Mahto 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Descriptive Statistics: Box Plot 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6296/l.8/ 



Collaborative Statistics Using R 

Collaborative Statistics Using R is designed to introduce students to basic ideas of statistics while simul- 
taneously teaching them how to use some of the basic functions in the open-source statistics project, R. 
The ultimate goal of this collection is to gradually add as many chapters from the Collaborative Statistics 
textbook with updated instructions on how to use R for the calculations. 



About Connexions 

Since 1999, Connexions has been pioneering a global system where anyone can create course materials and 
make them fully accessible and easily reusable free of charge. We are a Web-based authoring, teaching and 
learning environment open to anyone interested in education, including students, teachers, professors and 
lifelong learners. We connect ideas and facilitate educational communities. 

Connexions's modular, interactive courses are in use worldwide by universities, community colleges, K-12 
schools, distance learners, and lifelong learners. Connexions materials are in many languages, including 
English, Spanish, Chinese, Japanese, Italian, Vietnamese, French, Portuguese, and Thai. Connexions is part 
of an exciting new information distribution system that allows for Print on Demand Books. Connexions 
has partnered with innovative on-demand publisher QOOP to accelerate the delivery of printed course 
materials and textbooks into classrooms worldwide at lower prices than traditional academic publishers. 



